fix(cudf): Fix UCX output buffer lifetime after async cuDF concat by kjmph · Pull Request #17533 · facebookincubator/velox

kjmph · 2026-05-15T18:30:50Z

UcxPartitionedOutput::flushPending() synchronizes input streams before launching cudf::concatenate(), but concatenate itself is asynchronous on the output stream. When multiple pending inputs were flushed, the operator could clear pendingInputs_ immediately after launching concat, allowing the input CudfVector buffers to be deallocated before the concat kernels finished reading them.

This caused nondeterministic GPU result corruption. A TPC-H Q18 correctness loop against a DuckDB reference reproduced the issue regularly, and compute-sanitizer reported a use-after-free with this stack:

cudf::concatenate
UcxPartitionedOutput::flushPending

Fix the lifetime ordering without a CPU-blocking synchronize:

make the concat/output stream wait for all input streams before reading input table views;
prefer rebinding owned cuDF input buffers to the output stream so their stream-ordered deallocation happens after concat work on that stream;
fall back to a single event recorded on the output stream and waited by all input streams when inputs cannot be rebound.

This keeps the existing memory behavior of releasing pending input buffers before partitioning, while making their async lifetime ordering explicit.

Validated by:

TPC-H Q18 GPU correctness loop against stored DuckDB reference
compute-sanitizer with stream-ordered race tracking, confirming the original concat use-after-free is no longer reported

netlify · 2026-05-15T18:30:55Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`4fb8b97`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/6a231cde6038ea00081f9c81

meta-cla · 2026-05-15T18:30:55Z

Hi @kjmph!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

github-actions · 2026-05-18T01:23:35Z

Build Impact Analysis

Full build recommended. Files outside the dependency graph changed:

velox/experimental/cudf/exec/Utilities.cpp
velox/experimental/cudf/exec/Utilities.h
velox/experimental/cudf/tests/CMakeLists.txt
velox/experimental/cudf/tests/CudfVectorTest.cpp
velox/experimental/cudf/vector/CudfVector.cpp
velox/experimental/cudf/vector/CudfVector.h
velox/experimental/ucx-exchange/UcxPartitionedOutput.cpp

These directories are not fully covered by the dependency graph. A full build is the safest option.

cmake --build _build/release

Slow path • Graph generated from PR branch

github-actions · 2026-05-18T02:20:36Z

CI Failure Analysis

Auto-generated by the CI Failure Analysis workflow. This comment is updated in place each time CI fails on a new commit, so it always reflects the latest run — re-pushing or re-running CI will refresh the analysis below. Last updated 2026-06-05 21:34:35 UTC from workflow run 27034362865.

🔴 Linux release with adapters — BUILD Failure View logs

Build errors:

The build fails compiling the new test file CudfVectorTest.cpp because cudaError (a CUDA enum) is not formattable by fmt. The VELOX_CHECK_EQ macro at line 46 tries to format two cudaError values for the error message, but no fmt::formatter<cudaError> specialization exists.

FAILED: velox/experimental/cudf/tests/CMakeFiles/velox_cudf_vector_test.dir/CudfVectorTest.cpp.o

velox/experimental/cudf/tests/CudfVectorTest.cpp:46:5:   required from here
  269 |     auto message = ::facebook::velox::errorMessage(__VA_ARGS__);
      |                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~

/usr/local/include/fmt/base.h:2262:45: error: 'fmt::v11::detail::type_is_unformattable_for<cudaError, char> _'
                                         has incomplete type
 2262 |     type_is_unformattable_for<T, char_type> _;
      |                                             ^

The offending line in the new test file (CudfVectorTest.cpp:46) is:

VELOX_CHECK_EQ(
    cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking),
    cudaSuccess);

VELOX_CHECK_EQ uses fmt::format to render both arguments when the check fails. Since cudaError (the return type of cudaStreamCreateWithFlags) has no fmt formatter, compilation fails with -Werror.

Correlation with PR changes:

This failure is directly caused by this PR. The file velox/experimental/cudf/tests/CudfVectorTest.cpp is a new file added in this PR. The TestCudaStream constructor at line 46 uses VELOX_CHECK_EQ with cudaError values, which triggers the fmt unformattable type error.

Known issues:

No existing open issues track this specific failure.
Recent main branch CI runs (Linux Build using GCC) all pass, confirming this is not a pre-existing issue.

Reproduce locally:

make release  # will fail during compilation of CudfVectorTest.cpp

Recommended fix:

Replace VELOX_CHECK_EQ with CUDF_CUDA_TRY (which is already used elsewhere in the same file), or cast the cudaError values to int so fmt can format them. The simplest fix is to use the same pattern already present in the helper function makeTable():

// Instead of:
VELOX_CHECK_EQ(
    cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking),
    cudaSuccess);

// Use:
CUDF_CUDA_TRY(cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking));

This is consistent with the CUDF_CUDA_TRY usage on line 140 of the same file and avoids the fmt formatter issue entirely.

meta-cla · 2026-05-27T01:35:47Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

UcxPartitionedOutput::flushPending() synchronizes input streams before launching cudf::concatenate(), but concatenate itself is asynchronous on the output stream. When multiple pending inputs were flushed, the operator could clear pendingInputs_ immediately after launching concat, allowing the input CudfVector buffers to be deallocated before the concat kernels finished reading them. This caused nondeterministic GPU result corruption. A TPC-H Q18 correctness loop against a DuckDB reference reproduced the issue regularly, and compute-sanitizer reported a use-after-free with this stack: cudf::concatenate UcxPartitionedOutput::flushPending Fix the lifetime ordering without a CPU-blocking synchronize: * make the concat/output stream wait for all input streams before reading input table views; * prefer rebinding owned cuDF input buffers to the output stream so their stream-ordered deallocation happens after concat work on that stream; * fall back to a single event recorded on the output stream and waited by all input streams when inputs cannot be rebound. This keeps the existing memory behavior of releasing pending input buffers before partitioning, while making their async lifetime ordering explicit. Validated by: * TPC-H Q18 GPU correctness loop against stored DuckDB reference * compute-sanitizer with stream-ordered race tracking, confirming the original concat use-after-free is no longer reported

Update the cuDF vector deallocation ordering helpers to accept std::span instead of concrete std::vector references. These helpers only need read-only views over CudfVector pointers and CUDA streams.

CudfVector::rebindStream returned early when the vector's logical stream already matched the target stream. That misses packed_table storage, where gpu_data may still have a different deallocation stream after an intra-node UCX handoff. Update packed_table storage by setting the packed buffer deallocation stream directly. For owned cudf::table storage, rely on the current cuDF rebind_stream API and remove the old compatibility shim. Add CudfVector tests that verify both owned table and packed table buffers are deallocated on the rebound stream.

When CudfVector buffers cannot be rebound to the output stream, orderCudfVectorDeallocationsAfterStream falls back to recording an event on the output stream and making the input streams wait on it. Reuse a per-thread, per-device CUDA event for this fallback instead of creating and destroying a new event on each call. Intentionally leak the event to avoid CUDA calls from thread-local destructors after CUDA context teardown, matching libcudf's stream synchronization pattern.

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Use cuda_async_memory_resource in the CudfVector rebind tests so the instrumented resource records stream-ordered deallocation behavior. cuda_memory_resource ignores the stream and frees via cudaFree, so it only verified the wrapper-observed stream. The async resource makes the test match the lifetime ordering that CudfVector::rebindStream is intended to control.

kjmph requested review from bdice, devavret, karthikeyann and mhaseeb123 as code owners May 15, 2026 18:30

bdice reviewed May 15, 2026

View reviewed changes

Comment thread velox/experimental/cudf/vector/CudfVector.cpp Outdated

Comment thread velox/experimental/cudf/exec/Utilities.cpp Outdated

Comment thread velox/experimental/cudf/vector/CudfVector.cpp

mbasmanova changed the title ~~Fix UCX output buffer lifetime after async cuDF concat~~ fix(cudf): Fix UCX output buffer lifetime after async cuDF concat May 18, 2026

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 27, 2026

GregoryKimball added this to libcudf Jun 2, 2026

karthikeyann force-pushed the fix/experimental-cudf/use-after-free branch from 997def3 to 832f70e Compare June 2, 2026 15:57

karthikeyann requested review from jinchengchenghh and shrshi as code owners June 2, 2026 15:57

karthikeyann mentioned this pull request Jun 2, 2026

Investigate non-deterministic Cuda illegal memory access rapidsai/velox-testing#355

Open

karthikeyann reviewed Jun 2, 2026

View reviewed changes

Comment thread velox/experimental/cudf/vector/CudfVector.cpp

karthikeyann reviewed Jun 2, 2026

View reviewed changes

Comment thread velox/experimental/cudf/exec/Utilities.cpp Outdated

kjmph force-pushed the fix/experimental-cudf/use-after-free branch from 832f70e to a844318 Compare June 4, 2026 14:37

kjmph added 2 commits June 5, 2026 11:32

Use spans for cuDF stream-ordering helpers

a891a35

Update the cuDF vector deallocation ordering helpers to accept std::span instead of concrete std::vector references. These helpers only need read-only views over CudfVector pointers and CUDA streams.

kjmph force-pushed the fix/experimental-cudf/use-after-free branch from a844318 to a891a35 Compare June 5, 2026 15:33

kjmph added 2 commits June 5, 2026 11:35

bdice reviewed Jun 5, 2026

View reviewed changes

Comment thread velox/experimental/cudf/exec/Utilities.cpp

Comment thread velox/experimental/cudf/vector/CudfVector.h Outdated

Comment thread velox/experimental/cudf/tests/CudfVectorTest.cpp Outdated

Comment thread velox/experimental/cudf/tests/CudfVectorTest.cpp Outdated

kjmph and others added 3 commits June 5, 2026 14:45

Update velox/experimental/cudf/vector/CudfVector.h

9bdae55

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Update velox/experimental/cudf/tests/CudfVectorTest.cpp

5342968

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

bdice approved these changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533

fix(cudf): Fix UCX output buffer lifetime after async cuDF concat#17533
kjmph wants to merge 7 commits into
facebookincubator:mainfrom
kjmph:fix/experimental-cudf/use-after-free

kjmph commented May 15, 2026

Uh oh!

netlify Bot commented May 15, 2026 •

edited

Loading

Uh oh!

meta-cla Bot commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 18, 2026 •

edited

Loading

Uh oh!

meta-cla Bot commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kjmph commented May 15, 2026

Uh oh!

netlify Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

meta-cla Bot commented May 15, 2026

Action Required

Process

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Build Impact Analysis

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Failure Analysis

🔴 Linux release with adapters — BUILD Failure View logs

Uh oh!

meta-cla Bot commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

netlify Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 18, 2026 •

edited

Loading

github-actions Bot commented May 18, 2026 •

edited

Loading